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An  explicit  flow  solver,  applicable  to  the  hierarchy  of  model  equations  ranging  from 
Euler  to  full  Navier-Stokes,  is  combined  with  several  techniques  designed  to  reduce 
computational  expense.  The  computational  domain  consists  of  local  grid  refinements 
embedded  in  a  global  coarse  mesh,  where  the  locations  of  these  refinements  are  defined  by 
the  physics  of  the  flow.  Flow  characteristics  are  also  used  to  determine  which  set  of  model 
equations  is  appropriate  for  solution  in  each  region,  thereby  reducing  not  only  the  number 
of  grid  points  at  which  the  solution  must  be  obtained,  but  also  the  computational  effort 
required  to  get  that  solution.  Acceleration  to  steady-state  is  achieved  by  applying 
muitigrid  on  each  of  the  subgrids,  regardless  of  the  particular  model  equations  being 
solved.  Since  each  of  these  components  is  explicit,  advantage  can  readily  be  taken  of  the 
vector-  and  parallel-processing  capabilities  of  machines  such  as  the  Cray  X-MP  and 
Crav-2. 


1.  Introduction 


It  is  generally  recognised  that  a  comprehensive  approach  to  the  simulation  of  flows 
involving  both  complex  geometries  and  complex  physics  will  require  powerful  advanced- 
architecture  supercomputers  with  very  large  memories.  Machines  capable  of  producing 
solutions  to  Reynolds- averaged  Navier-Stokes  flows  over  complex  geometries  within 
computing  times  short  enough  to  be  of  design  interest  are  expected  to  be  available  by  the 
end  of  this  decade  [l].  In  order  to  use  these  parallel-processing  supercomputers 
effectively,  algorithms  must  be  adapted  to  focus  the  power  of  multiple  processing  units  on 
a  single  flow  simulation.  Furthermore,  the  history  of  computational  aerodynamics 
teaches  that  the  pace  of  progress  in  this  fleld  is  set  by  the  synergism  between  improved 
computers  and  better  algorithms.  In  the  past  15  years,  improved  computers  have  reduced 
the  cost  of  computation  by  a  factor  of  about  100.  Over  the  same  period,  better 
algorithms  have  reduced  the  cost  of  computation  on  a  given  computer  by  a  factor  of 
almost  1000  [2].  Thus,  it  is  to  be  expected  that  the  need  for  faster  algorithms  will  not  be 
diminished  by  the  availability  of  faster  and  larger  computers. 

The  most  popular  algorithms  presently  in  use  for  calculating  three-dimensional 
Navier-Stokes  flows  are  Beam-Warming  [3]  (or  similar  implicit  methods)  and  two  types  of 
explicit  schemes,  Runge-Kutta  [4]  and  Lax-Wendroff  (5).  Implicit  schemes  are  highly 
efficient  on  uniprocessor  machines,  and  may  even  be  adapted  to  parallel  computers  with  a 
small  number  of  processors  and  shared  memory  [6].  However,  as  shown  by  Bruno  [7],  they 
are  extremely  sensitive  to  the  size  and  location  of  memory  in  large  multiprocessor 
systems.  Runge-Kutta  and  Lax-Wendroff  methods,  on  the  other  hand,  being  explicit, 
map  readily  onto  parallel  architectures.  The  authors  chose  to  use  MacCormack’s  method 
in  the  present  work  because  of  its  robustness  and  their  experience  with  it,  although 
another  explicit  scheme,  such  as  Runge-Kutta,  could  be  used  in  its  place.  The  approach 
selected  enhances  the  efficiency  of  the  MacCormack  scheme  by  implementing  it  on  a 
collection  of  local  meshes  embedded  in  a  global  mesh.  Either  the  Euler,  thin-layer 
Navier-Stokes  or  full  Navier-Stokes  equations  are  solved  on  designated  meshes.  The 
choice  of  model  equations  is  determined  by  the  nature  of  the  flow  physics  to  be  resolved 
on  a  particular  mesh.  When  the  requirement  for  time  accuracy  is  relaxed,  a  convergence 
acceleration  procedure  is  applied  simultaneously  to  all  meshes  and  all  model  equations. 
The  entire  algorithm  is  explicit  and  is  designed  to  perform  well  on  computers  consisting  of 
multiple  processing  units,  each  having  vector  processing  capability.  Examples  of  such 
machines  are  the  Cray  X-MP  and  Cray  2. 


All  of  the  above-mentioned  elements  of  the  algorithm  have  been  integrated  into  a  fully 
three-dimensional  Navier-Stokes  flow  solver,  the  performance  of  which  is  being  evaluated. 
This  task  requires  a  very  large  memory,  high-speed  computer  such  as  the  Cray-2.  It  has 
256  million  words  of  shared  memory  and  four  vector  CPUs  and  is  the  principal  machine 
being  used  for  development  and  testing  of  the  scheme.  However,  until  mature,  reliable 
multitasking  software  is  available  on  the  Cray-2,  this  aspect  of  the  development  will  be 
continued  on  the  X-MP,  which  is  also  a  four-processor  system.  Since  the  X-MP  presently 
has  at  most  16  million  words  of  primary  memory,  the  problem  has  to  be  scaled  back  by 
decreasing  the  number  of  grid  points.  It  is  conjectured  that  parallel-processing  efficiency 
J  ades  for  smaller  problems,  so  that  the  X-MP  will  provide  a  lower  bound  on  expected 
performance  of  the  full-scale  simulation. 

2.  Equations  of  Motion 

The  nondimensional  equations  of  motion  may  be  written  in  conservation-law  form  as 

i,  =  -(/,+<?,+»,) 

where,  for  the  Reynolds-averaged  Navier-Stokes  equations, 

F  =  /  -  Re~lp  G  -  g  —  Re'lr  H  =  h  -  Re~ls 

while,  for  their  thin-layer  version, 

F  =  f  G  =  g  H  =  h  -  Rt~'d 

and,  for  the  Euler  equations, 

F  =  f  G  =  g  H  ~  h 

where: 
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Here  p,  u.  v.  w,  p  and  E  are  respectively  density,  velocity  components  in  the  x-,  y-  and  z- 
directions.  pressure  and  total  energy  per  unit  volume.  This  final  quantity  may  be 
expressed  as 


L-  ,  I  2  ,  2  ,  2s 

c  =  p  e  +  (u  +  w  +  ui  ) 


where  the  specific  internal  energy,  e,  is  related  to  the  pressure  and  density  by  the  simple 
law  of  a  calorically-perfect  gas 


P  =  (y  -  Up* 


wit  h  y  denoting  the  ratio  of  specific  heats.  The  coefficient  of  thermal  conductivity,  k,  and 
the  viscosity  coefficients.  X  and  p.,  are  assumed  to  be  functions  only  of  temperature. 
Furthermore,  \  is  expressed  in  terms  of  the  dynamic  viscosity,  p.,  by  invoking  Stokes' 
assumption  of  zero  bulk  viscosity.  Re.  and  Pr  denote  the  Reynolds  and  Prandt.l  numbers, 
respectively  Although,  for  simplicity,  the  equations  of  motion  are  presented  here  written 
in  (  artesian  coordinates,  it  is  well  known  that  their  strong  conservation  law  form  may  be 
maintained  under  an  arbitrary  space-  and  time-dependent  transformation  of  coordinates. 
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3.  Algorithm  Strategy 

In  order  to  minimize  the  cost  of  simulating  complex,  three-dimensional  viscous  flow 
over  complete  configurations,  the  following  strategy  has  been  developed: 

a.  Use  a  robust  and  flexible  explicit  flow  solver  capable  of  simulating  either  steady  or 

unsteady  flow  with  the  Euler,  thin-layer  Navier-Stokes  or  full  Navier-Stokes 
equations. 

b.  Distribute  grid  points  optimally  by  making  use  of  both  grid  stretching  and  locally- 
embedded  grid  refinements. 

c.  Make  use  of  a  zonal  flow  simulation  strategy  ranging  from  the  Euler  equations  through 

the  full  Navier-Stokes  equations  in  order  to  minimize  the  computational  work  per 
grid  point. 

d.  Accelerate  the  convergence  of  steady  flow  simulations  by  means  of  an  explicit 
multigrid  technique  which  may  be  applied,  without  modification,  to  the  entire 
hierarchy  of  model  equations.  Use  additional  convergence  acceleration  methods,  such 
as  residual  averaging,  as  appropriate. 

e.  Take  advantage  of  the  explicit  nature  of  the  algorithm  by  mapping  it  onto  a 
supercomputer  architecture  consisting  of  multiple  vector-processing  CPUs  and  thus 
enhance  its  performance  by  means  of  both  vectorization  and  multitasking. 

Further  detail  concerning  this  strategy  is  provided  in  [8]. 


4.  Parallel  Processing  Considerations 

Parallel  processing  may  be  viewed  in  terms  of  a  collection  of  separately-running 
programs,  called  processes,  which  exchange  information  among  themselves  by  means  of 
some  interconnection  scheme.  The  effects  of  load  balancing,  granularity,  overhead,  and 
Amdahl’s  law  are  all  important  factors  affecting  the  performance  of  parallel  computers. 

Amdahl’s  law  [9]  points  out  that  if  a  computer  has  two  speeds  of  operation,  the  slower 
mode  will  dominate  performance  as  the  faster  mode  becomes  arbitrarily  fast.  This  can  be 
expressed  by  the  relation 


S  =  ( R  +(  1  - R )/ N ) 
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where  5  is  the  maximum  speedup  achievable  by  using  N  processors  on  an  application 
which  has  a  fraction  of  code,  R,  which  must  be  executed  in  sequential  mode.  For 
example,  if  R  is  assumed  to  be  .02,  the  maximum  speedup  attainable  on  a  four-processor 
machine  will  be  3.77. 

The  effects  on  speedup  of  granularity  and  o/erhead,  which  are  also  important 
considerations  in  multitasking,  may  be  illustrated  as  follows: 

S=G(0+G/N)'1 

Granularity,  G,  is  defined  as  the  length  of  time  required  to  execute  some  code  segment  on 
a  single  processor.  Although  smaller-grained  tasks  are  generally  easier  to  create  than 
larger  ones,  the  overhead,  O,  associated  with  creation,  synchronization,  etc.,  may  negate 
any  performance  gains  which  would  otherwise  result  from  parallel  execution  of  smaller 
tasks.  To  attain  a  speedup  of  3.77  using  four  processors  would,  for  example,  require  that 
the  granularity  be  more  than  65  times  the  overhead. 

A  third  concern  in  multitasking  is  load  balancing,  or  the  distribution  of  computational 
work  across  some  number  of  processors.  Static  and  dynamic  partitioning  may  be 
employed  to  try  to  keep  all  processors  equally  busy.  Static  partitioning  is  most  effective 
when  tasks  of  equal  work  can  be  defined  a  priori-,  dynamic  partitioning  may  enable  better 
load  balancing  for  tasks  of  varying  length,  if  the  additional  synchronization  overhead 
incurred  is  not  too  great. 

Extensions  to  programming  languages  which  allow  the  creation  and  termination  of 
processes,  synchronization  of  processes,  and  communication  among  them  are  necessary  for 
multiprocessing.  Both  the  Cray-2  and  the  Cray  X-MP  have  software  libraries  that 
provide  such  multitasking  tools.  Two  variations  of  multitasking  are  available  on  the  X- 
MP,  namely,  macrotasking  and  microtasking.  Macrotasking  is  intended  for  application  to 
large-grained  problem  partitioning,  while  microtasking,  by  virtue  of  its  very  low 
overhead,  may  be  used  efficiently  at  a  fine-grained  level.  Only  macrotasking  is  available 
on  the  Cray-2.  For  the  current  application,  both  types  of  multitasking  have  been 
examined  and  tested.  Microtasking  is  easily  implemented  in  a  code  which  has  been 
optimized  for  vectorization.  Macrotasking,  however,  requires  careful  examination  of  the 
problem  in  order  to  define  large  code  structures  for  parallel  execution.  More  detailed 
discussions  of  these  multitasking  concepts  and  others  may  he  found  in  Larson  [lOj, 
Johnson  [ll],  and  Misegades  et  al.  [12]. 
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5.  Test  Problem 


The  geometry  of  the  three-dimensional  model  problem  is  representative  of  a 
turbomachinery  application  and  consists  of  a  rectilinear  cascade  of  finite-span,  swept 
blades  mounted  between  endwalls.  Test  cases  include  inviscid  subsonic  flows  and 
transonic  flows  with  shocks,  and  viscous  laminar  and  turbulent  flows  for  Reynolds 
numbers  ranging  from  8.4  x  10  to  2.0  x  10  (based  on  cascade  gap  and  critical  speed). 
We  believe  internal  flow  problems  to  be  more  challenging  than  external  problems  for  a 
number  of  reasons.  These  include  the  fact  that  internal  problems  limit  one’s  ability  to  use 
grid  stretching,  and  that  lateral  solid  boundaries  slow  convergence  by  only  letting 
transients  propagate  out  the  inlet  and  exit  rather  than  radiating  to  infinity  in  all 
directions. 

The  computational  domain  is  partitioned  by  the  collection  of  embedded  meshes. 
Three  levels  of  grid  fineness  are  used  in  the  present  application.  If  the  coarsest  mesh  (grid 
3)  is  thought  of  as  covering  the  whole  domain,  the  embedded  meshes  are  then  formed  by 
halving  or  quartering  the  grid  spacing  in  selected  regions.  Grid  1  refers  to  the  finest  mesh, 
which  lies  along  the  juncture  between  the  blade  and  the  endwalls.  Grid  2,  coarser  than 
grid  I  by  a  factor  of  two,  encompasses  all  surfaces  not  in  grid  1,  i.e.,  the  blade  and  wall 
surfaces  away  from  the  corners.  Any  coarse-grid  points  underlying  the  finer  grids  are 
coincident  with  points  on  those  grids.  The  intergrid  boundaries  are  treated  by 
overlapping  the  grids  such  that  the  boundary  of  an  embedded  mesh  lies  on  the  interior  of 
one  of  its  neighboring  meshes,  with  interpolation  used  where  necessary  to  fill  in  the 
surfaces. 

When  the  set  of  three  grids  described  above  is  used,  the  full  Navier-Stokes  equations 
are  solved  on  mesh  1,  the  thin-layer  Navier-Stokes  equations  are  solved  on  mesh  2  and  the 
Euler  equations  are  solved  on  the  coarsest  mesh,  mesh  3.  The  flowfield  updating  begins 
with  mesh  1.  After  one  timestep  on  mesh  I,  mesh  2  is  updated  exterior  to  mesh  1  while 
convergence  acceleration  is  applied  at  the  mesh-2  points  interior  to  mesh  1.  Next,  mesh  3 
is  updated  exterior  to  mesh  2  while  convergence  acceleration  is  applied  at  the  mesh-3 
points  interior  to  mesh  2.  Updating  proceeds  in  this  fashion  until  the  global  mesh  has 
been  advanced  by  one  timestep.  Then  the  updating  cycle  is  completed  by  applying 
convergence  acceleration  to  coarsenings  of  the  global  grid.  This  cycle  is  repeated  until  the 
desired  measure  of  convergence  is  satisfied. 
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Performance  of  parallel  computers  is  evaluated  by  comparing  wall  clock  time  for  both 
unprocessed  and  multiprocessed  runs  on  a  dedicated  machine.  This  ratio  is  called  the 
speedup.  Dividing  the  speedup  by  the  number  of  processors  gives  the  efficiency  of 
processor  utilization,  a  measure  of  load  balancing  and  overhead.- 

A  two-dimensional  version  of  the  code  makes  extensive  use  of  parallelism  inherent  in 
the  physical  problem  to  obtain  a  good  load  balance  when  using  the  X-MP  macrotasking 
software.  As  the  macrotasking  approach  requires  the  use  of  calls  to  a  subroutine  library, 
it  has  rather  high  overhead  and  thus  yields  best  results  for  large-grained  code  segments. 
Table  I  presents  some  two-dimensional  macrotasking  results  from  both  a  Cray  X-MP/48 
and  a  Cray-2,  for  the  basic  solver  with  multigrid.  X-MP  performance  shows  that  the 
algorithm  has  been  efficiency  parallelized,  while  the  poorer  performance  on  the  Cray-2  is 
due  to  less  mature  macrotasking  software  on  that  machine.  In  Table  II,  macrotasking 
results  for  the  basic  solver  are  shown  and  contrasted  with  the  same  code  run  using  the 
microtasking  approach.  Microtasking  is  managed  within  CPUs,  through  the  use  of  the 
X-MP  cluster  registers.  The  very  low  overhead  attained  by  microtasking  allows  users  to 
partition  code  at  a  fine-grained  level  while  still  making  efficient  use  of  two  or  more  CPUs. 
The  microtasking  results  in  Table  II  are  only  marginally  better  than  the  macrotasking 
ones  because  the  algorithm  employed  in  the  test  had  been  restructured  to  maximize  task 
granularity.  Three-dimensional  microtasking  results  for  a  small-grained  partitioning  of 
the  algorithm  are  shown  in  Table  III. 

Comparison  of  the  embedded-mesh  algorithm  with  a  single-mesh  algorithm  yields  the 
following  general  conclusion:  the  accuracy  of  the  embedded-mesh  results  is  essentially 
that  of  a  global  finest  mesh,  while  the  convergence  rate  is  like  that  of  a  global  coarsest 
mesh.  In  two-dimensional  computations,  using  the  Euler  and  thin-layer  Navier-Stokes 
equations  and  three  mesh  regions,  embedding  speedups  as  high  as  30  in  comparison  to  a 
single-mesh  algorithm  have  been  obtained  (see  Table  IV).  A  three-dimensional  algorithm 
has  been  designed  and  implemented.  As  shown  in  Table  V,  results  have  been  obtained  for 
simple  embeddings  which  span  the  y  direction,  using  relatively  coarse  grids  with  no 
tuning.  These  results  are  consistent  with  their  two-dimensional  analogs. 
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7.  Conclusions 

A  procedure  for  solving  complex  three-dimensional  aerodynamic  flows  on  parallel¬ 
processing  supercomputers  has  been  presented.  This  procedure  incorporates  a  number  of 
innovations  in  order  to  attain  high  levels  of  computational  efficiency.  These  innovations 
include:  locally-embedded  mesh  refinements,  a  zonal  flow  simulation  strategy  that  solves 
the  Euler  equations  through  the  full  Navier-Stokes  equations,  multigrid  convergence 
acceleration  applied  to  a  robust  explicit  basic  flow  solver,  and  both  vectorization  and 
multitasking. 

Computations  have  been  carried  out  on  parallel-processing  supercomputers, 
principally  on  the  Cray-2,  but  also  on  the  Cray  X-MP  because  of  its  more  sophisticated 
multitasking  software.  The  results  presented  here  illustrate  that  a  four  CPU  shared- 
memory  multiprocessor  can  be  used  to  carry  out  aerodynamics  simulations  with  a  high 
degree  of  efficiency. 

The  embedded  grid  scheme  has  demonstrated  performance  increases  on  the  order  of  30 
compared  to  the  global  fine  grid  solution,  while  maintaining  the  fine  grid  solution 
accuracy. 
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Tables 


Table  I.  Two-Dimensional  Macrotasked  Multigridded  Scheme  Performance 


Machine 

2  Processors 

4  Processors 

Speedup 

Efficiency 

Speedup 

Efficiency 

Cray  X-MP 

1.87 

0.94 

3.30 

0.83 

Cray  2 

1.80 

0.90 

2.58 

0.65 

Table  II.  Two-Dimensional  Multitasked  Basic  Solver  Performance 


Machine 

2  Processors 

4  Processors 

Speedup 

Efficiency 

Speedup 

Efficiency 

Cray  X-MP 
with 

macrotasking 

1.91 

0.96 

3.58 

0.90 

Cray  X-MP 
with 

microtasking 

1.93 

0.97 

3.78 

0.95 

I 

Table  III.  Three-Dimensional  Microtasked  Multigridded  Scheme  Performance 


Machine 

2  Processors 

3  Processors 

4  Processors 

m 

Efficiency 

Efficiency 

Efficiency 

Cray  X-MP 

1.96 

0.98 

2.83 

0.94 

g 

0.89 
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Table  IV.  Two-Dimensional  Embedding  Speedups 

(129  x  33  x  33  finest  grid,  2  embeddings) 


Test  Case 

Speedup 

Inviscid  Subcritical 

16.4 

Inviscid  Supercritical 

6.1 

Turbulent  Viscous 

30.2 

Table  V.  Three-Dimensional  Embedding  Speedups 

(65  x  17  x  17  finest  grid,  1  embedding) 


Test  Case 

Speedup 

Inviscid  Subcritical 

7.0 

Inviscid  Supercritical 


4.6 


